The Slovene BNSI Broadcast News database and reference speech corpus GOS: Towards the uniform guidelines for future work

نویسندگان

  • Andrej Zgank
  • Ana Zwitter Vitez
  • Darinka Verdonik
چکیده

The aim of the paper is to search for common guidelines for the future development of speech databases for less resourced languages in order to make them the most useful for both main fields of their use, linguistic research and speech technologies. We compare two standards for creating speech databases, one followed when developing the Slovene speech database for automatic speech recognition – BNSI Broadcast News, the other followed when developing the Slovene reference speech corpus GOS, and outline possible common guidelines for future work. We also present an add-on for the GOS corpus, which enables its usage for automatic speech recognition.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

BNSI Slovenian broadcast news database - speech and text corpus

This paper presents the BNSI Slovenian Broadcast News database project. The result of the project is a database with speech and text corpus oriented toward large vocabulary continuous speech recognition in general domain. The speech corpus consists of 36 hours of transcribed evening and late night news. The raw database material was captured in the archive of national broadcaster RTV Slovenia t...

متن کامل

SINOD - Slovenian non-native speech database

This paper presents the SINOD database, which is the first Slovenian non-native speech database. It will be used to improve the performance of large vocabulary continuous speech recogniser for non-native speakers. The main quality impact is expected for acoustic models and recogniser’s vocabulary. The SINOD database is designed as supplement to the Slovenian BNSI Broadcast News database. The sa...

متن کامل

The goo300k corpus of historical Slovene

The paper presents a gold-standard reference corpus of historical Slovene containing 1,000 sampled pages from over 80 texts, which were, for the most part, written between 1750 – 1900. Each page of the transcription has an associated facsimile and the words in the texts have been manually annotated with their modern-day equivalent, lemma and part-of-speech. The paper presents the structure of t...

متن کامل

RUNDKAST: an Annotated Norwegian Broadcast News Speech Corpus

This paper describes the Norwegian broadcast news speech corpus RUNDKAST. The corpus contains recordings of approximately 77 hours of broadcast news shows from the Norwegian broadcasting company NRK. The corpus covers both read and spontaneous speech as well as spontaneous dialogues and multipart discussions, including frequent occurrences of non-speech material (e.g. music, jingles). The recor...

متن کامل

TUKE-BNews-SK: Slovak Broadcast News Corpus Construction and Evaluation

This article presents an overview of the existing acoustical corpuses suitable for broadcast news automatic transcription task in the Slovak language. The TUKE-BNews-SK database created in our department was built to support the application development for automatic broadcast news processing and spontaneous speech recognition of the Slovak language. The audio corpus is composed of 479 Slovak TV...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014